LONDON DATASTORE: Recommendations, workflows and challenges from a data analyst’s perspective

A deep-dive from an academic’s, an analyst’s and GIS researcher’s perspective demonstrating common workflows integrating the London Datastore. The proposed recommendations and modifications for the Future London Datastore are designed to enhance the Datastore as the premier choice for analysts seeking up-to-date and precise data on London
Author

University College London - UCL |Julian Hoffmann Anton

Published

19-02-2024

Executive Summary

This report presents an extensive analysis of the current London Datastore (LDS - 20.02.2024), examining common use cases and improvements to consider for the future developments and expansions of the platform. It offers an in-depth exploration from three distinct viewpoints with common interests: academic research, data analysis, and Geographic Information System (GIS) research. This multifaceted approach demonstrates typical workflows for integrating data from the London Datastore into various fields of study and application. The proposed recommendations and modifications are designed to enhance the Datastore as the premier choice for analysts seeking up-to-date and precise data on London.

The first section revisits the previously set priorities for the future development of the London Data Catalogue (LDC). Following this, we introduce the variety of user tools for data analysis and mapping. We then present an analysis of the LDS’s current metadata, acknowledging its complex scope and identifying areas requiring enhancements for the user and overall potential of the platform. Then we weave in observations and recommendations from the workflows of contemporary data analysts, researchers, and academics. This journey, from the initial idea to the final output, leverages the LDS as a central resource, through various data types, access formats, and frameworks. Consequently, we have identified and summarized 15 key observations and recommendations, which are detailed in the following sections.

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Access to the London Datastore : https://data.london.gov.uk

Key Recommendations

I. Standardization and Efficiency

  1. Encourage data producers to adopt a standardized approach for data sharing formats, ensuring that dataset format is tidy1, consistent, and efficient. [Platform policy & Data inventory]

  2. Emphasize on key geographical datasets and mapping shapes that are most commonly used for GIS applications in London, greatly facilitating their use. [Design and Data inventory]

II. Metadata Quality and Utility

  1. Enrich the metadata tagging and labeling, while improving the metadata quality and consistency. [Platform Policy & Data inventory]

  2. Correct, expand, and refine metadata tags to enhance their scope and inter-connectivity, fostering the discovery of relevant related data on the platform. [Platform policy & Data inventory]

  3. Improve the visibility and accuracy of the data’s release dates, and ensure clear indicators are present for data updates and changes. [Product functionality & Design]

III. User Interface and Experience

  1. Highlight the platform’s API capabilities and facilitate its use by guiding the user through examples. [Design, One-off product & Product functionality]

  2. Implement versatile sorting features, allowing users to filter datasets by relevance, publication date, and other relevant criteria. [Product functionality]

  3. Always facilitate the access to the raw data behind dashboards and reports, for more analytic freedom and transparency.

  4. Enhance the user experience for data exploration and navigation, allowing for more intuitive searches and the discovery of relevant new and past data. [Product functionality & Design]

  5. Simplify the process for users to find and utilize published analyses and reports, making them more accessible. [Product functionality & Design]

  6. Clearly differentiate among datasets, analytical reports, and tabular data to prevent confusion and streamline user interaction. [Product functionality & Design]

  7. Ensure that file format information is displayed in a clear, concise, and transparent manner, aiding in user comprehension. [Product functionality & Design]

  8. Provide a succinct description for each dataset to convey its contents and purpose at a glance. [Platform policy]

  9. Introduce a better search engine system or Large Language Model chatbot assistance to explore the datasets available. [Product functionality]

IV. Future-proofing and Technology Adaptation

  1. Anticipate and adapt to future data-sharing needs by supporting advanced formats and technologies like point clouds,LiDAR, 3D, APIs, and real-time data streams. [Data inventory & Product functionality]

  2. Gather, analyze, and explore user metrics and their downloads to stay ahead of data demand and learn from users. [Product management]

  3. Prepare a strategy for aging datasets and growing catalog, with labels, folder, archives or sign-posting [Product management & Product functionality]

The report breaks down these recommendations and has more detailed calls to action 🔷 and labels the suggestions by difficulty of implementation 🟩🟨 🟥.

Introduction

“What do data analysts want from the data they use and where they access it from? What would a bold and visionary approach to data provision (a new data store) look like?”

There is very often a disconnect between how data custodians or producers structure and provide access to their data compared to the needs of potential end users. In this report, we will look into different classes of users, each with their format priorities while at the same time a lot of common interests. The purpose of this study is to inform the London Datastore Team with practical examples of the platform uses and how improvements can make a big difference scaled to all the users of the London Datastore. As data end-users, data analysts, scientists and researchers, how would recommendations from the report on the London Datastore´s Future 2 affect and help users? We will work backwards from them and the outputs to the London Data Store and its sources. For that we also do a study of the metadata of the current LDS data catalog, extracting descriptive statistics and insights. We will also elaborate on the effect of the implementation of the Data Services Standard identified in the GLA´s Data Services Investment Proposal (2021) 3 – more specifically, looking at the users needs and problems, how to facilitate the accessibility to data and how some changes will incentivise the use of the platform.

This will lead us to mention the so-called ‘Tidy Data Principles4 and similar frameworks for optimal dataset formats and efficient standardisation. We will extend the recommendations that are found at the end of ‘Discovering the Future of the London Data Store’ with tangible examples which show how they affect the end-user’s experience. This will also elaborate on the ‘Data Services Investment Proposition (2021)’ section on Data Services Standards from an academic’s, an analyst’s and researcher’s perspective, different users of the London datastore but with similar workflows.

1 Background of the report

1.1 Previous research to design the future London Datastore

Our study expands on the findings of extensive research done by GLA and the Open Data Institute. The current improvements identified in the Future of the London Datastore report are the following:

  1. Up to date data
  2. Broader datasets and types of data
  3. Metadata and different views for insights
  4. Improving navigation and search function - including expanding categories and better description of them
  5. Offering signpost and updates indicators
  6. Offering different formats – and allowing interactive visualisation and charts, but also better integration between data and analytical outputs

We will do the exercise of diving deeper into the workflow affected by the mentioned points and elaborate their impact on analytics projects.

It will include the observation from the “Data Services Investment Proposal (2021) - Appendix B”5 and at the same time, we will expand the “Data Service Standards” proposed concerning the user:

  1. Understand users and their needs. Look at the full context to understand what the user is trying to achieve, not just the part where they interact with the GLA. Understanding as much of the context as possible gives you the best chance of meeting users’ needs in a simple and cost effective way. In a data services context, the core “user” may be a particular type of officer in a London Borough, who has their own particular work context to understand.
  2. Solve a whole problem for users. Work towards creating a service that solves one whole problem for users, collaborating across organisational boundaries where necessary. This may mean aligning with other parts of the GLA or central government to combine data or other supporting services, or simply ensuring that the data service can connect seamlessly with related tasks, for example by ensuring that it uses a common language and that data is exportable in a helpful format.
  3. Provide a joined up experience across all channels [less relevant to data services, as they are generally not consumed e.g. offline]
  4. Make the service simple to use: make sure the service helps the user to do the thing they need to do as simply as possible - so that people succeed first time, with the minimum of help; test for usability frequently with actual and potential users, using appropriate research techniques ; design the service to work with a range of devices that reflects users’ behaviour - in a data services context, most services may be delivered to officers’ laptops and desktops, but should take account of security/firewall limitations, browsers, etc in their corporate environments.
  5. Make sure everyone can use the service: where users are officers, this will mainly mean meeting accessibility standards .

We will proceed with tangible examples from the users perspective.

2 A modern data analysis & mapping project workflow

This is a hands-on journey from idea to publication highlighting where the London Datastore (LDS) can support and facilitate most user’s workflows. Most research pieces involving data, commonly require a series of tasks, transformations and operations which can be extremely time consuming, especially if we aggregate it to the whole user base of the London Datastore. The benefits from providing data the right way will accelerate and multiply the use and impact of the LDS on Londoners.

2.1 Tools for data and maps

Modern data users will most likely combine multiple tools to extract insights out of data. Here is a non-exhaustive list of the most commonly used tools, their importance varies by industry, sector and purpose:

  • MS Excel: A ubiquitous tool for basic data analysis, manipulation, and visualization, especially in business contexts.
  • Python: A versatile programming language with extensive libraries for data analysis (e.g., Pandas, NumPy, SciPy, Matplotlib).
  • R: A programming language and environment focused on statistical computing and graphics, widely used in academia and research. This report and analysis was created purely with R and RStudio.
  • SQL: Essential for managing and querying structured data in relational databases.
  • Tableau: A powerful tool for data visualization, enabling the creation of interactive and shareable dashboards.
  • Power BI: Microsoft’s business analytics service, providing interactive visualizations and business intelligence capabilities.
  • SAS: An integrated software suite for advanced analytics, business intelligence, data management, and predictive analytics.
  • SPSS (IBM SPSS Statistics): Widely used for statistical analysis in social science, it offers a range of analytical tools.
  • Matlab: A high-level language and interactive environment used heavily in engineering and scientific computing.
  • Stata: A tool for data manipulation, visualization, statistics, and automated reporting.

2.2 Modern Geographic Information Science tools (GIS)

  • ArcGIS: A comprehensive GIS software for creating, analyzing, and managing geographic data, widely used in various industries.

  • QGIS (Quantum GIS): An open-source GIS software that supports viewing, editing, and analysis of geospatial data.

  • R (Programming language) and RStudio: Known for its strong statistical capabilities, R has packages like sp, rgdal, and rgeos, which are used for spatial data analysis and visualization.

  • Python: A versatile programming language with GIS-focused libraries like Geopandas, Shapely, Fiona, and Pyproj, widely used for scripting and automating GIS processes.

  • CARTO, Mapbox Studio , Felt and other GIS online platforms are becoming more popular.

Now that we know the analyst’s toolkit, we have to find the right data in the right and easy to use format.

2.3 Exploring data and finding data as a user

First of all let’s establish the scenarios you as a user will encounter.

A .You know what you are looking for:

  1. 🟢🟢🟢🟢 Best case: you find what you want and more than you expected.

  2. 🟢🟢 Great case: you find what you are looking for and nothing else.

  3. 🟢 Good case: you don’t find what you want but find something useful.

  4. 🔴 Bad case: you don’t find what you want and nothing else is useful.

B. You don’t know what you are looking for:

  1. 🟢🟢🟢 Excellent case : you discover something to start a new project or to influence a project.
  2. 🔴 Bad case:the exploration process and data discovered doesn’t inspire you to start a project or include new data in your projects:

C. You don’t know enough

  1. 🟠 Bad case: you see something you want to use but don’t have the capacity to use it because the data access is unclear, too complex and messy.

Any change on the platform should aim to increase the likelihood of 🟢 scenarios 1,2,3,5 and decrease 🔴4, 6 and 🟠 7. We will be gathering recommendations for that purpose.

2.4 Empowering the catalog with better meta data

2.4.1 Meta-data analysis of the current catalog

Our analysis will be done programmatically with R and Quarto as tools which are open-source and free. The code for the graphics and data wrangling will be displayed above the charts.

2.4.1.1 Loading all the metadata of the Datastore as of 08.02.2024

The meta data can be accessed as an excel file, .CSV, .JSON by clicking on the Export at the bottom right of the page or in a more complex way with the API.

Code
metadata <- read_excel("Data/All meta data - export -2024.02.08.xlsx")

When researchers and analysts first land on a data platform, these common questions arise:

  • Is this really the latest data available?

  • How much of the data is being updated?

  • Are old datasets being refreshed?

  • Are many new datasets appearing regularly?

  • Are these datasets a one off publication?

  • When is the data release going to happen?

The graphs in the next section provide answers and will give an overview of the update rates and status of datasets.

2.5 The update frequency

What you see in the visualisation are the names of the datasets on the y-axis (every 10th for clarity and simplicity purposes) and on the x-axis, you see the initial upload date 🔴 and the last update date 🟢 of the dataset.

The red vertical line shows the year 2020 and as of 2024.02.18 , 49.7% of the datastore (586 out of 1177) have been last updated before 2020.01.01🔷. By definition, the older the datastore the more archival content it will have so there has to be rigorous strategy to deal with historical and for some still relevant data🟥. The rate of uploaded dataset might increase or decrease, this could be a key performance indicator (KPI) to keep track of to plan accordingly 🟨.

Code
#Convert your string to a POSIXct object, making sure to specify the correct timezone if needed
vertical_line_datetime <- as.POSIXct("2020-01-01", tz = "UTC")

every_nth = function(n) {
  return(function(x) {x[c(TRUE, rep(FALSE, n - 1))]})
}


# Plot with the adjusted geom_hline for the vertical line
ggplot(metadata %>% 
          mutate(short = str_sub(metadata$title, 1, 20)) %>%
          head(1200) %>% 
          arrange(desc(updatedAt)), 
       aes(x = reorder(title, updatedAt))) +
  geom_segment( aes(y=createdAt, yend=updatedAt, xend=title), color="grey") +
  geom_point( aes(y=updatedAt), color=rgb(0.2,0.7,0.1,0.5), size=1) +
  geom_point( aes(y=createdAt), color=rgb(0.7,0.2,0.1,0.5), size=1 ) +
  coord_flip() +
  geom_hline( yintercept = vertical_line_datetime, col = "red") + # Use geom_hline with yintercept
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.y = element_text(size = rel(0.5))) +
  
  scale_x_discrete(breaks = every_nth(n = 10))+
  xlab("Title (every 10th of 1177)") +
  ylab("Added date and update date")+
  labs(title = "Dataset date and update",
       subtitle = "As of 2024.02.08 there are 1177 datasets")

Code
pre_count<-metadata %>% 
  filter(as.Date(updatedAt)<as.Date("2020-01-01")) %>% 
  summarise(count =n())
tot_count<-metadata %>% 
  summarise(count =n())

pre_count/tot_count
     count
1 0.497876
Code
chart_df <- metadata %>% 
  select(update_frequency) %>% 
  group_by(update_frequency) %>%
  summarise(count =n()) %>% 
  mutate(total = sum(count))

ggplot(data = chart_df, aes(x =count, y = reorder(update_frequency, (count))))+
    geom_bar(stat="identity", fill ="Steelblue") +
    geom_text(aes(label = paste0(" ",count, " - ", round(count/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="The distribution of the update frequency available for 1177 datasets",
       x= "Number of datasets",
       y="")+
    scale_x_continuous(limits = c(0, max(chart_df$count) * 1.15))  # Expanding the limit to 20% more than the max count

2.6 The data publishers

The Greater London Authority (GLA) label is the largest publisher with 478 datasets equal to 41% of all, followed by 13% by ONS. 8% of datasets or from publishers which only published one data set. This representation also shows that GLA in fact, has other subcategories of publishers such as the “GLA Opinion research” 4% which combined then represent more than 3/4 of datasets.

Code
chart_df <- metadata %>% 
  select(publisher) %>% 
  group_by(publisher) %>%
  summarise(count =n()) %>% 
  mutate(publisher_2 = ifelse(count == 1, "***Single publication only", as.character(publisher))) %>%
  group_by(publisher_2) %>%
  summarise(summ = sum(count)) %>%
    mutate(total = sum(summ))%>%
  mutate(GLA = (grepl("GLA|Greater London Authority", publisher_2)))

  ggplot(data = chart_df, aes(x =summ, y = reorder(publisher_2, summ)))+
    geom_bar(aes(fill = GLA),stat="identity") +
    geom_text(aes(label = paste0(" ",summ, " - ", round(summ/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="Publishers of the 1177 datasets",
       x= "Number of datasets",
       y="") +
  scale_fill_manual(values = c("Dark red", "Dark green"))+
  scale_x_continuous(limits = c(0, max(chart_df$summ) * 1.15))  # Expanding the limit to 20% more than the max count

2.7 The data authors

Most authors only publish 1 or 2 datasets in the LDS with 22% of the catalogue. The label NA appears for 14 percent of times which is missing information.

We also find Greater London Authority as the most prolific author of datasets, nevertheless we also find inconsistent labeling and sub-GLA groups.

Greater London Authority 13%, GLA 5%, Greater London Authority (GLA) 1%, GLA Economics 3% etc. Other examples of inconsistent labels happen with Transport for London (TFL) or Census data etc. Correcting these is possible and would make the catalog clearer, more efficient and transparent 🟩.

Code
chart_df <-metadata %>% 
  select(author) %>% 
  group_by(author) %>%
  summarise(count =n()) %>% 
  mutate(publisher_2 = ifelse(count <= 2, "***1 or 2 publicatiions", as.character(author))) %>%
  group_by(publisher_2) %>%
  summarise(summ = sum(count))%>%
    mutate(total = sum(summ))%>%
  mutate(GLA = (grepl("GLA|Greater London Authority", publisher_2)))


  ggplot(data =chart_df, aes(x =summ, y = reorder(publisher_2, summ)))+
    geom_bar(aes(fill = GLA),stat="identity") +
    geom_text(aes(label = paste0(" ",summ)), hjust =0) +
    geom_text(aes(label = paste0(" ",summ, " - ", round(summ/total,2)*100,"%")), hjust =0) +
    theme_minimal()+
  labs(title ="Authors of the 1177 datasets",
       x= "Number of datasets",
       y="")+
  scale_x_continuous(limits = c(0, max(chart_df$summ) * 1.15)) + # Expanding the limit to 20% more than the max count
  scale_fill_manual(values = c("Steelblue", "Darkred")) 

2.7.0.1 Meta data interconnection

The are currently 18 main tag categories, with many sub-tags. In the following fifure we the co-occurrence of labels in all 1177 datasets.

The diagonal of this table shows the total number of datasets in that category, the largest being “Demographics” with 217 datasets. The other tiles show the co-occurrence of that label with the horizontal topic label (or vertical if you prefer).

The matrix shows that most labels will also be linked with most of other label but it might us wonder if some of the smaller labels such as “covid-19” shouldn´t be more common than they actually are.

  • Should there have been more datasets with the label “london-2012”? 🔷

  • Should the 18 categories be reevaluated and new labels added? 🔷

  • Should some categories be archived? 🔷

  • Are 18 categories enough?🔷

Code
# Assuming 'your_matrix' is the 18x18 matrix you want to display
# your_matrix <- matrix(rnorm(18*18), nrow = 18, ncol = 18)

## Convert the matrix to a data frame in long format
matrix_long <- as.data.frame(co_occurrence_matrix_18) %>%
  rownames_to_column(var = "row") %>%
  pivot_longer(cols = -row, names_to = "column", values_to = "value") %>%
  mutate(row = str_replace_all(row, "topic-", ""))%>%
  mutate(column = str_replace_all(column, "topic-", ""))


order_vector <- metadata %>%
  select(id, starts_with("topic")) %>%
  pivot_longer(cols = starts_with("topic"),names_to ="topic", values_to ="is")%>%
  filter(is == "TRUE") %>%
  group_by(topic) %>% 
  summarise(n=n())%>%
  mutate(topic = str_replace_all(topic, "topic-", "")) %>% arrange(desc(n))

order_vector_1 <- as.character(order_vector$topic)



# Convert 'row' and 'column' to factors with levels specified by 'order_vector'
matrix_long$row <- factor(matrix_long$row, levels = order_vector_1)
matrix_long$column <- factor(matrix_long$column, levels = order_vector_1)

ggplot(matrix_long, aes(x = column, y = row, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), vjust = 1) +
  scale_fill_distiller(palette = "Spectral") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  coord_fixed() +
  labs(title ="Matrix of co-occurrence of topics")

2.7.0.2 A Network analysis approach

The labels and tags can be represented as a network, to visualise and calculate centrality measures and let patterns emerge. Here in the figure below, each major category is a node, and each edge represents a link is when a dataset contains both categories of labels and becomes proportionally wider. The larger the node, the more datasets the label has and the larger the edges the more connected both nodes are.

2.7.0.3 Granular Tag Analysis

There are many more sub-tags beyond the main 18 categories. Without correcting for typos and different spellings, there are 1575 different tags. Now, fore the future of the LDS it is important to re-emphasize and create a new systematic tagging system 🟨 for a richer meta-data ecosystem, for data exploration and an easier a search method of the catalogue. The graph below shows the frequency of each subtag, the most common being borough-economics, which apears 83 times, followed by population, which appears 80 times.

I’ve been told that the labels are more often set by the data creators and less by LDS. An initial correction of typos and review wouldn’t be too time consuming as a one-off measure 🟩, before setting in place another more systematic labeling or guidance for the data creators 🟨.

The vast majority of tags appears less than 5 times as seen on the figure below, representing the histogramme of the frequency of appearance. More tagging could help link datasets and create clusters beyond the 18 main categories 🟩. Search engines and large language model (LLM) assistants would benefit from it and help users find useful dataset and define new projects for London.

2.8 Data access and formats

Here is a list of some encountered issues:

  • There isn’t a clear distinction between Raw Tabular data from Reports and Publications in pdf.

  • Hundreds of pdfs in one place, without a summary or short description.

  • Inconvenient formats: column names are in a separate file from the data.

  • Outdated: Shapes for geographic boundaries and files. For London we had to go directly to ONS to look for the latest geographic boundaries, something required for most GIS and mapping work in London. Link to the latest CENSUS geography

  • The flow of exploration of datasets is not optimal. Long scrolling is necessary🔷, without sorting options 🔷 by date or other relevant dimensions. The filtering options 🟨 could be improved as well with with access improved tags, keywords and format info.

  • Outdated data links. Crowd-source the update-control process by adding a possibility for users to signal to the London Data Store team when a link is broken or a new link or data source is available 🟥 . Example of broken links:

    Broken Links to Tfl api. 6

    Broken link to Tfl planner. 7

    Broken link to Tfl feeds.8

  • The store should include and highlight more the date of the dataset or warn that there is a risk it may be outdated or have currently broken link 🟨.

  • There aren’t any links to LiDAR datasets, which are public 3D datasets. The store could potentially link to governmental open-access LiDAR sources 🟩 9

  • Simple access to data: dashboards, analytics and reports are great but leaving easy access to the raw data 🟩 behind the dashboard is key, not only for researchers who want to do more with the data but also for transparency purposes. An example of this would be the Residential-completions-dashboard, which was a One-off update according to the label (it hasn’t been updated in a year), which is a shame for a interactive dashboard, where the access to the data behind is not awalable as far as we tried.

# A tibble: 1 × 1
  sum_of_all
       <dbl>
1       1706

We can look at the frequency of formats on the platform but here we don’t have the information about how many files each dataset contains. In fact, it is often the case that there are multiple files in one dataset. The figure above shows that the most common file format is a Spreadsheet, appearing 677 times. In second place there is the .PDF format with 334 and and third place with 250 for .CSV.

There are also 111 ZIP files where we most likely don’t know what format is inside. This could also something to be clarified 🟩.

3 APIs and the future of real-time data

The Application Programming Interface (API) pipeline is currently not easily visible and hidden behind the “Guidance” menu. There is also a “Developers” section I was only able to find through Google🟩. Furthermore, the “Datastore API” page is not linked to the “Developers” page and neither are clearly visible on the main page of the London Datastore. Combined, they could be put forward as a key asset of the platform for developers and advanced data science users🟨.

The hidden developers sections simply has 4 links that the user can test to fetch with the API, without any more explanations. The API documentation is present but there aren’t any detailed examples to facilitate the use of it. Users have to be already advanced analysts and familiar users to access it, or spend a significant amount of time to discover the possibilities.🟨 [Causes user scenario 7.🟠 - User doesn’t know enough]

The use of the Application Programming Interface (API) could be more prominently highlighted 🟩, this would facilitate the creation of many apps, automated dashboards and other tools that programmatically always extract the latest version of a dataset from the LDS.They could also precisely download a series of required and updated datasets always located at the same address, and could be browsed with the api.

This opens the gate to many opportunities for the use of the London Datastore and reduces the burden of manually downloading files and go through many time consuming and repetitive tasks to use the data.

All in all, more guidance and examples with a more prominent presence on the platform will attract more users to more ambitious use-cases of the data🔷.

3.1 Accessing the meta data through the API

Currently there is documentation for the use of the API but it isn’t specific to the LDS and it doesn´t explain the possibilities based on a dataset and examples. So, a blog, tutorial 🟥 or article presenting 🟨how to access the api with Python, R or any other way would significantly facilitate the task for any new user.

In the code below shows how I accessed the metadata as .JSON file in R. Many more hours are needed to truly display all the possibilities empowered by the API which could be described in a simple yet clear tutorial🟥.

Code
# Load necessary libraries
library(httr)
library(readr)
library(jsonlite)
library(tidyverse)

# Set API key and endpoint
api_key <- "your_api_key_here"

# The API endpoint from where you want to download the JSON file
api_url_metadata <- "https://data.london.gov.uk/api/datasets/export.json"

# Make the GET request with the API key in headers or query parameters as required
response_meta <- GET(api_url_metadata, add_headers(`Authorization` = paste('Bearer', api_key)))

# Check if the request was successful using the status_code() function
if (status_code(response_meta) == 200) {
  # Extract content from the response object
  content <- content(response_meta, "text", encoding = "UTF-8")
  
  # Parse the JSON content
  json_meta_data <- fromJSON(content)
  
  # Now json_data contains the data as an R object, which you can manipulate or analyze
} else {
  print("Failed to download JSON")
}

3.2 Developer tab: getting the list of datasets

“Developers” page

Code
# The API endpoint from where you want to download the JSON file
api_url <- "https://data.london.gov.uk/api/action/package_list"

#or extracted the address from the metadatafile column "webpage". You still need to insert "/api" before"/dataset" in the url.

# Make the GET request with the API key in headers or query parameters as required
response_list<- GET(api_url, add_headers(`Authorization` = paste('Bearer', api_key)))

# Check if the request was successful using the status_code() function
if (status_code(response_list) == 200) {
  # Extract content from the response object
  content <- content(response_list, "text", encoding = "UTF-8")
  
  # Parse the JSON content
  json_data_list <- fromJSON(content) #%>% flatten()
  
  # Now json_data contains the data as an R object, which you can manipulate or analyze
} else {
  print("Failed to download JSON")
}

3.3 Get a dataset through the api

Ideally you should easily access any dataset through the address found in the metadata file and then used in the API with input from the meta data file. Otherwise you can also manually write the wanted URL. The only requirement is to insert “api/” before “dataset” within the URL.

As stated before, a walkthrough document for R or Python could speed up and incentivise the use of the API and the LDS.🟥

Code
# The API endpoint from where you want to download the JSON file
api_url <- "https://data.london.gov.uk/api/dataset/electricity-consumption-borough"

#or extracted the address from the metadatafile column "webpage". You still need to insert "/api" before"/dataset" in the url.

# Make the GET request with the API key in headers or query parameters as required
response <- GET(api_url, add_headers(`Authorization` = paste('Bearer', api_key)))

# Check if the request was successful using the status_code() function
if (status_code(response) == 200) {
  # Extract content from the response object
  content <- content(response, "text", encoding = "UTF-8")
  
  # Parse the JSON content
  json_data <- fromJSON(content)
  
  # Now json_data contains the data as an R object, which you can manipulate or analyze
} else {
  print("Failed to download JSON")
}

3.4 Real-time present and future

As the world continues to increase the frequency and amount of data creation, the London Datastore should prepare for this trend 🔷. As we showed in the data update frequency graph in the previous section, there are many update rates, going from 24% annually, to 3 (0%) in real time and 27% simply “ad hoc”.

For real-time and higher frequency updated data the API access is key.

There are some initiative such as aldjlaskjdlaj

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

inspiration from

Other real-time data project in London such as:

3.5 Tidy data principles

A notoriously huge amount of effort is spent cleaning data to get it ready for analysis, this is a known burden in the Data Science world. As a solution, Hadley Wickham developed a framework that makes it easy to tidy messy datasets.

The definition is simple, the structure of the data has to be as follows: each variable is a column, each observation is a row, and each type of observational unit is a table 🔷. Tidy datasets are easy to manipulate, model and visualize. We will show the advantages of a consistent data structures with case studies from the London Datastore and how everyone could benefit from it by beeing freed from mundane and long data manipulation chores.

Tidy Data - Hadley Wickham synthesize and created this efficient framework in Data Science and data architecture published in a now acclaimed publication of Journal of Statistical Software 10 and is the best practice among data scientists and researchers today.

3.6 Loading Geographic Data

Geographic data is a major pillar of the London Datastore and should be highlighted further 🔷 due the fact that most of its datasets contain a spatial dimension in form of coordinates, administrative boundaries or other geographic boundaries.

The LDS should contain and support common GIS file formats such as:

  1. Shapefile (.shp): A vector data format for storing the location, shape, and attributes of geographic features. It is Important to note that inconveniently, this format is a composite file format that is made out of multiple files so it is often stored within a folder.

  2. GeoPackage (.gpkg): A compact, portable format for transferring geospatial information, supporting a wide range of data types without file size limitations.

  3. GeoJSON (.geojson): A lightweight format for encoding a variety of geographic data structures using JavaScript Object Notation.

  4. KML (Keyhole Markup Language) (.kml): An XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers.

  5. GML (Geography Markup Language) (.gml): An XML grammar for expressing geographical features, serving as a modeling language and interchange format for geographic systems.

  6. GeoTIFF (.tif): A format that allows georeferencing information to be embedded within a TIFF file, widely supported by image-processing and GIS applications.

  7. CSV (Comma-Separated Values) (.csv): A simple format used to store tabular data, including geographic coordinates, easily imported into and exported from GIS software.

  8. XLSX (Microsoft Excel Open XML Spreadsheet) (.xlsx): Can store geographic coordinates and spatial data, often used for data collection and analysis before importing into GIS software.

  9. LAS (LiDAR Aerial Survey) (.las): The standard format for storing LiDAR point cloud data, accommodating a wide array of point data and associated attributes.

  10. LAZ (Compressed LAS) (.laz): A compressed version of the LAS format, using lossless compression to maintain data integrity while significantly reducing file size.

The hierarchical representation of UK Statistical Geographies is notoriously complex as you can see on this schema. Any help provided for users interested in London could drastically facilitate the task.🔷

Now, making the access to the most commonly used shapes easy, simple and friction-less could significantly improve the efficiency and increase the use of the platform.🔷

Another complementary option would be cataloging or creating a list just for London of important links to the latest relevant GIS resources such as Ordnance Survey (OS)11 , Office of national Statistics (ONS)12 or CENSUS geographies (ONS)13🔷.

For example, the London Boroughs boundaries is an an extremely common GIS file to download. Today, in the LDS the boundaries are available, updated 2 years ago (while the stated update frequency is Monthly). Secondly, the desired external link to the supposedly latest version is perfectly there but unfortunately doesn’t work. Making it possible for LDS users to signal a broken link would be excellent 🟨. Otherwise, creating a specific section with common GIS files and links that is regularly checked would be of great help🟨.

A significant amount of datasets in the store are not labelled with the granularity of their geographic boundaries, aka their spatial resolution. The figure above shows that if we exclude unlabeled datasets, the most common geography is Local Authority with 250 dataset at that scale. It is followed by Greater London 135 and Borough 72. It would be interesting to retrospectively label the 398 datasets with the correct label and make sure that this labeling system is enforced for future datasets. 🔷

Once the geographic data is downloaded, there are many common tasks and transformations which are needed for the analysis. For example, some kind merging and linking data to geographic data or opearations like geocoding are almost certain to be required. We will be describing such task for data analysis in general in the following sections.

3.7 Exploratory analysis and data wrangling

3.7.1 Filter Data

3.7.2 Sub-setting, selecting and filtering the data.

For research and analysis around London it is is often required to look at national level or regional levels datasets and then there is the need to narrow it down to the 32 Boroughs and the City for example. Any help or national dataset that can be sub-set and filtered to the Greater London Area and stored in the LDS would be a great asset and support for analytics.

3.7.3 Data Transformation (splitting, aggregating, summarising etc.)

[analysis to be continued]

3.7.4 Data and merging

[analysis to be continued and examples to be added]

3.7.5 Data filtering and sorting

[analysis to be continued and examples to be added]

3.7.6 Data correcting / cleaning

[analysis to be continued and examples to be added]

3.7.7 Gap filling / modelling

[analysis to be continued and examples to be added]

3.7.8 Data formatting and pivoting (wide to long / long to wide)

[analysis to be continued and examples to be added]One of the most common data operations at the start of any analysis in EXcel, R, Python, transformations

3.7.9 Geocoding

[analysis to be continued and examples to be added]

When coordinates are missing or geographic shapes are absent, GIS analysts and researchers need to apply “Geocoding”, which in spatial data science refers to the process of converting addresses or other geographic descriptors into numerical coordinates on the Earth’s surface. Typically, this involves transforming a description like a street address, city name, or postal code into a precise latitude and longitude. These coordinates can then be used for various purposes such as mapping, spatial analysis, and geographic data visualization. [to be continued]

3.7.10 Storing / retrieving / updating

[analysis to be continued and examples to be added]

3.7.11 Data persistence and linking to historical data and updates etc.

[analysis to be continued and examples to be added]

3.7.12 Example: London static map

Combination of GIS files of the London Datastore

Link to geo file: https://data.london.gov.uk/dataset/ultra_low_emissions_zone

[analysis to be continued and examples to be added]

Reading layer `london_boroughs' from data source 
  `C:\Users\julia\Hoffmann Dropbox\Julian Hoffmann\0. Julian Studio\1. Julian Studio Projects\2023.08 - UCL-GLA London Data Store\London Datastore Online Report\Data\London_Boroughs.gpkg' 
  using driver `GPKG'
Simple feature collection with 33 features and 7 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 503568.2 ymin: 155850.8 xmax: 561957.5 ymax: 200933.9
Projected CRS: OSGB36 / British National Grid
Reading layer `InnerUltraLowEmissionZone' from data source 
  `C:\Users\julia\Hoffmann Dropbox\Julian Hoffmann\0. Julian Studio\1. Julian Studio Projects\2023.08 - UCL-GLA London Data Store\London Datastore Online Report\Data\InnerUltraLowEmissionZone.gpkg' 
  using driver `GPKG'
Simple feature collection with 1 feature and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 518493.4 ymin: 172850.4 xmax: 543936.5 ymax: 192597.7
Projected CRS: OSGB36 / British National Grid
Reading layer `LondonWideUltraLowEmissionZone' from data source 
  `C:\Users\julia\Hoffmann Dropbox\Julian Hoffmann\0. Julian Studio\1. Julian Studio Projects\2023.08 - UCL-GLA London Data Store\London Datastore Online Report\Data\LondonWideUltraLowEmissionZone.gpkg' 
  using driver `GPKG'
Simple feature collection with 22 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 503899.4 ymin: 156666 xmax: 559664.7 ymax: 200880.3
Projected CRS: OSGB36 / British National Grid
Code
ggplot(data=london_br_sf)+

  geom_sf(col ="black", fill = "steelblue")+
  geom_sf(data =london_ULEZ2021_sf, col = "red", fill ="transparent", inherit.aes = F)+
 geom_sf(data =london_ULEZ_sf, col = "#32CD32", fill ="transparent", inherit.aes = F)+

  theme_minimal()+
  labs(title = "London Boroughs and London ULEZ",
       subtitle = "Inner ULEZ 2021 and London wide ULEZ August 2023 update")

3.7.13 London Interactive Map

Now at days, if the GIS data is easily accessible in the right format (geopackage for example) it straightforward to create interactive maps in R.

Code
library(leaflet)
library(leaflet.providers)
# Create the leaflet map with the shapefile

leaflet_map <- leaflet() %>%
  #addProviderTiles(providers$CartoDB.DarkMatter) %>% # Adding a basemap
  #addProviderTiles(providers$ProviderName)
 addTiles(urlTemplate = 'http://{s}.basemaps.cartocdn.com/dark_all/{z}/{x}/{y}.png')%>%
  addPolygons(data = london_br_sf)%>% # Adding the shapefile
  addPolygons(data = london_ULEZ2021_sf,    
              color = "red", # Border color
    fillOpacity = 0.2)%>% # Adding the shapefile
  addPolygons(data = london_ULEZ_sf, color = "darkgreen", # Border color
    fillOpacity = 0.2,) # Adding the shapefile

  leaflet_map

Conclusions

[Conclusions to be added]

Footnotes

  1. Tidy data format definition: “Tidy datasets are easy to manipulate, model and visualise, and have aspecific structure: each variable is a column, each observation is a row,and each type of observational unit is a table.” https://vita.had.co.nz/papers/tidy-data.pdf↩︎

  2. (2019 - Open Data Institute: https://theodi.org/)↩︎

  3. (See 2021 Report Appendix B/*)↩︎

  4. Tidy Data Principles: https://r4ds.hadley.nz/data-tidy.html↩︎

  5. Data Services Investment Proposal (2021) - Appendix B↩︎

  6. Broken Links to tfl api: https://data.london.gov.uk/dataset/tfl-live-bus-arrivals-api↩︎

  7. https://data.london.gov.uk/dataset/journey-planner-api-beta↩︎

  8. https://tfl.gov.uk/info-for/open-data-users/our-feeds↩︎

  9. Public LiDAR data: https://www.data.gov.uk/dataset/f0db0249-f17b-4036-9e65-309148c97ce4/national-lidar-programme↩︎

  10. Tidy Data: https://www.jstatsoft.org/article/view/v059i10↩︎

  11. Ordnance Survey: https://www.ordnancesurvey.co.uk/products/boundary-line↩︎

  12. ONS Link https://geoportal.statistics.gov.uk/search?q=BDY_ADM&sort=Date%20Created%7Ccreated%7Cdesc↩︎

  13. Census geographies: https://www.ons.gov.uk/methodology/geography/ukgeographies/censusgeographies/census2021geographies↩︎